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Abstract 

In most sampling algorithms, including 
Hamiltonian Monte Carlo, transition rates 
between states correspond to the probability 
of making a transition in a single time step, 
and are constrained to be less than or equal 
to 1 . We derive a Hamiltonian Monte Carlo 
algorithm using a continuous time Markov 
jump process, and are thus able to escape 
this constraint. Transition rates in a Markov 
jump process need only be non-negative. We 
demonstrate that the new algorithm leads to 
improved mixing for several example prob¬ 
lems, both by evaluating the spectral gap of 
the Markov operator, and by computing au¬ 
tocorrelation as a function of compute time. 

We release the algorithm as an open source 
Python package. 


1 Introduction 

Efficient sampling is a challenge in many tasks involv¬ 
ing high dimensional probabilistic models, in a diver¬ 
sity of fields. For example, sampling is commonly re¬ 
quired to train a probabilistic model, to evaluate the 
model’s performance, to perform inference, and to take 
expectations under the model [1]. 

In this paper we introduce a method for more efficient 
sampling, by making Markov transitions in continuous 
rather than discrete time. This allows transitions into 
lower probability states to occur more often, with a 
shorter time spent for each visit, and thus allows for 
more rapid exploration of state space. We apply this 
approach to develop a novel Hamiltonian Monte Carlo 
(HMC) sampling algorithm. Finally, we demonstrate 
the effectiveness of this approach by comparing both 
spectral gaps and autocorrelation on several example 
problems. 
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1.1 Discrete time sampling 

Most sampling algorithms involve transitioning be¬ 
tween states in discrete time steps, with a fixed in¬ 
terval. In this discrete-time framework, the transition 
rates out of a state must sum to 1 and be non-negative. 
In fact, the popular Metropolis-Hastings acceptance 
rule [2] for Markov Chain Monte Carlo (MCMC) works 
well because it maximizes the transition rate between 
a pair of states, subject to this constraint and to de¬ 
tailed balance. As we will see, however, this constraint 
on transition rates limits performance, and better mix¬ 
ing can be achieved by allowing transition rates larger 
than 1. 

1.2 Markov jump process 

A Markov process can also be expressed in continu¬ 
ous time, in which case the only restriction on the 
transition rates between distinct states is that they be 
non-negative. In continuous time, the rate of transi¬ 
tion from a state j into a state i 7^ j is given by T^j, 
and the rate of change of the probability pi of state i 
is 

f = (1) 

3 

where Vij >0 for Vi 7^ j, and we use the convention 

If a particle evolves in a system of this form it makes 
stochastic transitions between a set of discrete states 
in continuous time. Each transition is governed by a 
Poisson process. Neglecting other states, the waiting 
time Wij for a transition from state j to state i will 
be drawn from an exponential distribution, P{wij) = 
T ij exp T^jf ) • 

To simulate the system, waiting times Wkj are gener¬ 
ated for all candidate states k, and the shortest waiting 
time i = argmin^. w^j is chosen. We call this shortest 
waiting time the holding time. A transition is then 
performed to state i after a delay of length Wij . 
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A system that evolves in this way is known as a Markov 
jump process [3]. Markov jump processes have been 
used to model physical systems, such as chemical re¬ 
actions, which are well described by stochastic transi¬ 
tions between discrete states [4]. Work has also been 
done on efficient sampling of trajectories in Markov 
jump processes [5] and the statistical properties of 
these trajectories [6]. 

To our knowledge, Markov jump processes have not 
previously been applied to general purpose Monte 
Carlo sampling, though see [7] where a Markov jump 
process is used to sample from a posterior distribution 
over model graph structure. One barrier to using a 
Markov jump process for general Monte Carlo is that 
it is necessary to compute transition rates to all pos¬ 
sible target states i from the current state j. As we 
will see, however, for HMC we only need to consider a 
small number of target states. The primary contri¬ 
bution of this paper is to use a Markov jump 
process to develop a faster HMC algorithm. 

1.2.1 Relationship to importance sapling 

As will be elaborated in Section 2.4, samples from 
a Markov jump process can be generated by using 
an equivalent discrete time process T to generate 
the same distribution over state sequences^ and then 
resampling according to each state’s holding time. 
From this perspective, the process of sampling from 
a Markov jump process can be seen as a realization 
of importance sampling, with a particularly unusual 
proposal distribution. The equivalent discrete time 
process T defines the importance sampling proposal 
distribution, and the holding times provide the impor¬ 
tance weights. 

The discrete time distribution p generated by T will 
tend to be more similar to a uniform distribution than 
p, and the corresponding Markov chain T will thus 
tend to mix more quickly than a typical discrete time 
sampler. For instance, T has no self-transitions, so un¬ 
like in a standard Metropolis-Hastings algorithm there 
is no sample rejection, and as a result there is likely to 
be less wasted computation. Qualitatively, the equiv¬ 
alent discrete time process T can be expected to visit 
low probability states far more frequently than an un¬ 
weighted sampler. Those states will just have very 
short holding times, and be assigned very small im¬ 
portance weights. This will allow it to more rapidly 
explore the state space. 

1.3 Hamiltonian Monte Carlo 

Hamiltonian Monte Carlo (HMC) [8, 9] is the state- 
of-the-art, general purpose Monte Carlo algorithm for 
sampling from a distribution tt (x) over a continuous 



Figure 1: (a) The action of operators involved in 

Hamiltonian Monte Carlo (HMC). The base of each 
red or green arrow represents the position x, and the 
length and direction of each of these arrows represents 
the momentum v. The flip operator F reverses the mo¬ 
mentum. The leapfrog operator L approximately in¬ 
tegrates Hamiltonian dynamics. The trajectory taken 
by L is indicated by the dotted line. The random¬ 
ization operator R replaces the momentum, (b) The 
ladder of discrete states generated by the leapfrog (L) 
and flip (F) operators. Application of F corresponds 
to movement across the ‘rungs’ of the ladder. Appli¬ 
cation of L corresponds to movement up the right side 
of the ladder, or down the left. Inset arrows illustrate 
closed loops of constant total probability flow under 
our chosen rate (Equation 3) 


state space x G 

HMC utilizes the same Hamiltonian dynamics that 
govern the evolution of a physical system - for instance 
a marble rolling in a swimming pool - to rapidly tra¬ 
verse long distances in state space. In HMC the state 
space is first extended to include auxiliary momentum 
variables v G with distribution 7r(v), such that 
the joint state space over position and momentum is 
C = {x, v}, with joint distribution tt {() = tt (x) tt (v). 
An analogy is then made between x and physical po¬ 
sition (e.g. the position of the marble), between v and 
physical momentum (the momentum of the marble), 
and between (—log7 r(x)) and potential energy (the 
height of the swimming pool at position x). Since 
physical dynamics conserve energy, they can generate 
very long trajectories in state space while remaining 
on a constant probability contour of tt (zeta). 

HMC is thus able to move very long distances in state 
space in a single update step. 
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1.3.1 Discrete state space Ladder 

As introduced in [10] and illustrated in Figure 1, HMC 
can be viewed in terms of transitions on a discrete state 
space ladder. This state ladder is formulated by ex¬ 
pressing the action of HMC on a sampling particle in 
terms of three operators. The leapfrog integration op¬ 
erator, L, approximately integrates Hamiltonian dy¬ 
namics for a fixed number of time steps and a fixed 
step length. 

The momentum flip operator, F, reverses a particle’s 
direction of travel along a contour by flipping its mo¬ 
mentum. The momentum randomization operator R 
redraws the momentum vector from tt (v), and moves 
a particle onto a new state space ladder. 

This perspective suggests a powerful formalism, effec¬ 
tively discretizing the state space, and it illuminates 
the structure^ of HMC. Throughout this work, we refer 
to the structure generated by the operators as a state 
ladder. As illustrated in Figure 1, L causes move¬ 
ment up the right side of a state ladder and down 
the left side, whereas F causes horizontal movement 
across the rungs of a ladder. A trajectory can be ex¬ 
actly reversed by reversing the momentum, integrating 
Hamiltonian dynamics, and reversing the momentum 
again. As can be seen in Figure 1, this corresponds to 
making a loop on the state space ladder, and it implies 
that FLFL = I, where I is the identity operator. R 
causes movement off of the current state ladder and 
onto a new one. Both F and L are volume preserv¬ 
ing, which will eliminate the need to consider volume 
changes when computing Markov transition rates. 

If we only allow transitions between states that are 
connected on the state ladder (Figure 1), then transi¬ 
tions can only occur between ( and three other states 
FC, LC). This makes HMC well matched to 
Markov jump processes, since only a small number of 
transitions need be considered. 

1.3.2 Current research 

The development of improved methods for perform¬ 
ing HMC is an important area of active research. 
These include the use of shadow Hamiltonians that are 
more closely conserved by the approximate Hamilto¬ 
nian integrator [11], Riemann manifold HMC [12] and 
other investigation of its geometry [13], quasi-Newton 
HMC [14], Hilbert space HMC [15], parameter adapta- 

^Between momentum randomizations, HMC acts in a 
manner isomorphic to the Dihedral group of, in general, 
infinite order. The HMC state ladder, and thus the or¬ 
der, is generally infinite because trajectories through state 
space produced by Hamiltonian dynamics are almost never 
closed. 


tion techniques [16], Hamiltonian annealed importance 
sampling [17], split HMC [18], tempered trajectories 
[9], novel discrete time transition rules [10, 19, 20], 
stochastic gradient variants on HMC [21], HMC for 
approximate Bayesian computation [22], and new ap¬ 
proximate Hamiltonian integrators [23]. 

2 Markov jump Hamiltonian Monte 
Carlo 

We now discuss various aspects of our MJHMC sam¬ 
pler, schemetized in Figure 2. 

2.1 Continuous time transition rates on 
HMC state ladder 

A Markov process must satisfy two conditions [24] to 
sample from a target distribution . The first is ergod- 
icity, which requires that the process will eventually 
explore the full state space; this is typically straight¬ 
forward to satisfy. 

The second condition is that the target distribution 
must be a fixed point of the Markov process. This 
is usually achieved via dynamics that satisfy detailed 
balance matched to the state probabilities of the target 
distribution. 

2.1.1 Closed loops preserve the fixed point 
distribution 

Markov transition rates T {(' \ that preserve tt {() 
as a fixed point can be constructed from closed loops 
of constant probability flow in state space. 

For closed loops to have constant probability flow, the 
flow r {(i I Cj) = r {(i I Cj)^ iCj) from state j to i must 
be identical for each link in the loop. This is analo¬ 
gous to Kirchoff’s current law for electrical circuits - 
constant probability flow in all loops implies that the 
net probability flow into any state equals the net flow 
out of that state. 

2.1.2 Choosing transition rates 

For the case of HMC, we set the loops to be between 
the “rungs” of the state ladder, as illustrated in Figure 
1. The loop balance condition for each closed loop 
becomes 

f(FC I C) = r(LFC I FC) = f{L-H \ LFC) = f(C | L-^C), 

( 2 ) 

TT (C) f (FCI C) = ^ (FC) f (LFCI FC) 

= 7r(LFC)f (L-^CILFC) 

= TT (L-^C) f (C I L-y) . (3) 
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Figure 2: Illustration of Markov Jump HMC sampling dynamics. The red curve represents a particle trajectory 
for 400 time steps. Blue shading indicates the probability density, plotted at the left, and an empirical histogram 
of the samples is shown at the right. The inset blowup at the bottom shows how movement of the sampling 
particle corresponds to transitions on the state ladder, using the symbolic and graphical conventions described 
in Figure 1. Note that the sampling particle dwells in a position for a duration related to the probability density 
of that state relative to its neighbors. We have indicated the transition from one state ladder (vertical green 
ladder) to a new ladder (angled purple ladder) following a momentum randomization event, resulting in a new 
state labeled . 


In order to satisfy this condition, we set the transition 
rates to be 


f(C' I C) 


^(LC) " 

_ 7r(C) J ^ 

zdLFOl ^ 

7r(FC) 

0 


C'=LC 
C' = FC 

otherwise 


( 4 ) 


One can verify by direct substitution that these tran¬ 
sition rates satisfy Equation 3. The transition rates 
for the full ladder consist of a sum over the transition 
rates for each loop. 


2.1.3 Opposing flows cancel 

As can be seen in Figure 1, adjacent loops make flip 
transitions across the “rungs” of the ladder in oppo¬ 
site directions. After summing over all loops, the net 
transitions across the ladder approximately cancel. 

This allows us to reduce the flip rates in both direc¬ 
tions, such that the flip rate in one direction is zero. 
The final rate of flip transitions in our algorithm will 
thus be 


r(FC I C) = f (FC I C) - min [f (FC | C), f (C I FC)' 


o,f(FC|C)-f(C|FC) 

0 , 


p(LFC)l 

1 

2 

r7r(LFFC)l 

1 -| 

2 

7r(FC) J 


[ 7r(FFC) J 

( 


= max 



1 

2 

p(LC)] 

1 

2 

7r(C) 


. ^(0 . 



( 5 ) 

(6) 


( 7 ) 


(8) 


where the final step relies on the observations that 
FF( = C, and that ir {Q = tt {FQ [10]. In prac¬ 
tice, TT (L“^C) will typically already be available (up to 
a shared normalization constant) from the preceding 
Markov transition, and will not need to be computed. 

Due to discretization error, the leapfrog integrator for 
Hamiltonian dynamics only approximately conserves 
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probability. Equation 8 shows that the residual flow 
across the ladder stems from this discretization error of 
the leapfrog integrator. This is completely analogous 
to the cause of momentum flips in standard HMC. 

2.2 Momentum randomization 

In discrete time HMC the momentum is periodically 
corrupted with noise. If this was not done, then sam¬ 
pling would be restricted to a single state ladder, and 
mixing between state ladders would not occur. In or¬ 
der to accomplish the same end in continuous time, we 
jump to a state with a constant transition rate /3. 
A transition to corresponds to replacing the mo¬ 
mentum V with a new draw from tt (v). The transition 
rate from v to a particular v' is thus pir (v'). It can 
be seen by substitution that this rate satisfies detailed 
balance. 


2.3 Final transition rates 


Combining the transition rates derived in Sections 
2.1.2, 2.1.3, and 2.2, 


r(C' I C) 


max 


0 , 


^(LC) 
7r(C) 


’■(C) 


’’(LQ l 2 
”(0 


C' = LC 


C' = FC 


/? 

0 


C' = RC 

otherwise 


( 9 ) 


We verify that these transition rates satisfy the bal¬ 
ance condition for tt (C) in Appendix A. The third line 
is a slight abuse of notation since R^ does not corre¬ 
spond to a single fixed state, but rather indicates that 
the momentum is replaced by a new draw from tt (v), 
where this replacement is triggered by a Poisson pro¬ 
cess with rate p. Note that as in [10] these dynamics 
do not satisfy detailed balance, and can be expected 
to mix more quickly as a result [25]. 


2.4 System time vs compute time 

We have described continuous time dynamics in terms 
of a system, or simulation, time. However, when ap¬ 
plying this sampler to a real problem it is its per¬ 
formance as measured relative to compute time that 
matters. Here we show how to relate the continu¬ 
ous time dynamics of the Markov jump process to a 
discrete time Markov process, with an approximately 
fixed computational cost per time-step. 

First we observe that there is a discrete time Markov 
process describing only the sequence of visited states, 
thus neglecting the holding time spent in each state. 


For notational convenience we represent Markov pro¬ 
cesses using matrix notation in this section. The up¬ 
date rule for this Markov process can be written 

P-+1 = tr, (10) 

f = I ri/cT^i Tij+Tik ^ ^ 3 

1 0 i = j 


where the matrix T is the Markov transition kernel, 
and the vector p^ is the probability distribution over 
system states at timestep r. The computational cost of 
each time step is roughly constant under this Markov 
chain, since each step requires computing the transi¬ 
tion rate to all possible next states in Equation 4. 

The current and fixed point distributions p and tt un¬ 
der this process can be related to the corresponding 
distributions p and tt under the Markov jump process 
by scaling by the expected holding time. 


TT = 

(12) 

p=|dp. 

(13) 


where D is a diagonal matrix with the expected hold¬ 
ing times for each state on the diagonal, Djj = 
^ ^ p , and Z is a normalization constant [26]. Sim- 
ilarly, the evolution of p relative to these discrete time 
steps can be expressed by scaling p in Equation 10 by 
the holding time. 


p^+i = DtD“ip^, (14) 

p"+i = Tp", (15) 

where T = DTD“^ describes the discrete time evo- 
lution of the samples. Since T and T are related to 
each other by a similarity transform, they share identi¬ 
cal eigenvalues. In order to evaluate the spectral gap, 
and thus the mixing time, of the Markov jump pro¬ 
cess in terms of computational time, it is thus suffi¬ 
cient to compute the spectral gap of T. We do this 
for randomly generated toy systems in Section 3.1 and 
Figure 3 and show that MJHMC has superior spectral 
gap characteristics, indicating more efficient mixing. 

2.5 Algorithm 

Here, we briefly summarize the Markov Jump HMC 
algorithm for generating N samples in pseudocode. As 
with all HMC sampling algorithms, an energy function 
E{x) (equivalent to — logTr (x) plus a constant) and its 
gradient are required. The three hyper parameters are 
the leapfrog step size e, the number of leapfrog steps 
per sampling step M; and the momentum corruption 
rate /3. 
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Note that computation of is only necessary when 
the last transition made was a momentum flip or ran¬ 
domization. The number of times the gradient is eval¬ 
uated in an MJHMC sampling step is comparable to 
that of standard HMC. 


Algorithm 1: Markov Jump Process HMC 


1 

2 

3 

4 

5 

6 


7 

8 

9 

10 


input : e,M,P,E{x),VE{x), N 

output: N samples 
Co Randomly initialized ; 

for i ^ 1 to A do 

Calculate states LCi-i, 5 

Compute A(Ci-i), A(LCi-i), A(FCi-i), 
^(L-'Ci-i) ; 

Compute transition rates P^, Pi?, using 
Equation (9) ; 

Draw waiting times wl, wr from an 
exponential distribution, using rates of Fi,, 
Pi?, Pi? respectively; 

Record holding time for Ci-i- 

hi-i ^ mm{wL,WF,WR) ; 

Set (i to whichever of LC, FC, RC had the 
shortest waiting time ; 

end 

Resample all Q using holding times hi as 
importance weights ; 


3 Experimental results 

3.1 Spectral gap on HMC state ladder 

The convergence rate of a Markov process to its steady 
state is given by its spectral gap [27]. This is the differ¬ 
ence in the magnitude of the two largest eigenvalues. 
We numerically compute this value for randomly gen¬ 
erated toy problems in order to compare our mixing 
rate to that of standard HMC. As all HMC algorithms 
randomize momentum in nearly the same way, it is 
expected that their mixing time over a single state 
ladder is representative of their mixing time over the 
entire state space. To achieve analytic tractability we 
restrict our attention to finite state ladders. To avoid 
edge effects, we attach the top and bottom rungs of 
the ladder to each other, so that the ladder forms a 
loop and = C, where k is the number of distinct 
rungs. We evaluate the eigenvalues on each state lad¬ 
der using the similarity relationship to a discrete time 
Markov chain in Equation 14. 

A comparison of such spectral gaps between Markov 
Jump HMC and standard HMC is illustrated in Eig- 
ure 3 as a function of state ladder size. We draw the 
energy for each ‘rung’ of the state space ladder from 
a unit norm Gaussian distribution, and average across 



Eigure 3: Comparison of mixing performance of 
Markov jump HMC (MJHMC) and standard discrete 
time HMC. Spectral gap versus size of state ladder. 
Eor large state ladder sizes, MJHMC is better by half 
an order of magnitude. 



Eigure 4: Autocorrelation versus number of gra¬ 
dient evaluations for standard HMC and MJHMC 
for the Rough Well distribution. The hyperparam¬ 
eters found by Spearmint for MJHMC are e = 
3.0, P = 0.012314, M = 25 and for control HMC 
5 = 0.591686, P = 0.429956, M = 25. 


250 draws for each ladder size. Eigure 3 thus shows 
performance averaged over many randomly generated 
energy landscapes. MJHMC mixes faster (has a larger 
spectral gap) for all except the smallest state space 
sizes. 


3.2 Autocorrelation on rough well 
distribution 

Explicit computation of mixing time for most problems 
is computationally intractable. It is common to in¬ 
stead use the rate at which the sample autocorrelation 
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approaches zero as a proxy. As illustrated in Figure 4, 
we compare autocorrelation traces for MJHMC with 
standard HMC and NUTS on the rough well distri¬ 
bution, and find that MJHMC performs significantly 
better. 

Our results, illustrated in Figure 4, indicate that 
MJHMC significantly outperforms standard HMC. 


3.2.1 Energy function 

The chosen energy function was the ‘rough well’ dis¬ 
tribution from [10]. This distribution provides a good 
test case because it is as simple as possible, while also 
presenting both well understood and significant chal¬ 
lenges to HMC-style samplers. Its energy function is 


E{x) = 


1 


{xl + xl) cos 


/ 7 ^ 

V ^2 


+ COS 



where ai = 100 and (72 = 4. Although this distri¬ 
bution is well conditioned everywhere, the sinusoids 
cause it to have a ‘rough’ surface, such that it re¬ 
quires many leapfrog steps to traverse the quadratic 
well while maintaining a reasonable discretization er¬ 
ror. 


3.2.2 Hyperparameter selection 


4 Discussion 

We introduced an algorithm, Markov Jump Hamilto¬ 
nian Monte Carlo (MJHMC), in which the state tran¬ 
sitions in Hamiltonian Monte Carlo sampling occur as 
Poisson processes in continuous time, rather than at 
discrete time steps. We demonstrated that this algo¬ 
rithm led to improved mixing performance, as mea¬ 
sured by explicit computation of the spectral gap, by 
the autocorrelation of the sampler on a simple but 
challenging distribution. 
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A Transition rates satisfy balance condition 


The continuous time balance condition states that, at the steady state distribution, there is no net change in the 
probability of states. 


dp{C) 


dt 


= 0 . 


p(C)=7r(C) 


In order to demonstrate that we satisfy the balance condition, we evaluate 
rates from Equation 9, 


(19) 


using the transition 


p(C)=7r(C) 


dp(0 


dt 
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( 20 ) 


where negative terms correspond to probability flow out of state ( into other states, and positive terms correspond 
to probability flow from other states into state There are only a small number of terms because transitions 
are only allowed to/from a limited set of states. We now proceed to simplify and cancel terms. 


dp{C) 


dt 
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( 21 ) 


( 22 ) 

(23) 

(24) 


Therefore the transition rates in MJHMC satisfy the balance condition for tt (C), as claimed. 



Manuscript under review by AISTATS 2016 


B Hyperparameter search 


B.l Demonstration of optimized hyperparameters 




200000 400000 600000 800000 1000000 1200000 

Gradient Evaluations 


Figure A.l: Comparison of mixing performance of MJHMC and standard discrete-time HMC with both samplers 
set to the same hyperparameters (a) Both samplers set to 5 = 0.591686,/3 = 0.429956, M = 25, the best setting 
for standard HMC found by Spearmint (b) Both samplers set to 5 = 3.0, (3 = 0.012314, M = 25, the best settings 
for MJHMC found by Spearmint 


The autocorrelation data illustrated in figure A.l demonstrates that Spearmint found effective hyperparameters 
for MJHMC and standard discrete-time HMC on our chosen energy function. Each sampler outperforms the 
other when both are set its optimized setting of hyperparameters. 
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B.2 Illustration of Spearmint search 
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Figure A.2: Search performance projected onto the (3 axis. 



Figure A.3: Search performance projected onto the 5 axis. 
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Figure A.4: Search performance projected onto the M axis. 


Figures A.2, A.3, A.4 illustrate the overall structure of Spearmint’s search for hyperparameters. The green stars 
represent a trial setting of MJHMC hyperpameter and the red crosses represent a trial setting of standard HMC 
hyper parameters. The y-axis represents the value of the objective function for each trial setting. It can be seen 
that in A.2 that MJHMC chooses smaller (3 values which suggests wanting to corrupt momentum more slowly 
as compared to the control case. It can also be seen from ?? that MJHMC prefers larger steplengths for the 
integrator (e) and steps (M) . 


C Derivation of equation 11 


First we calculate P{t 2 < ti) where is drawn from Exp(Ai) for i = 1, 2: 
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POO 

P{t2 < Ti) = / P(ti = tdt)P{T2 < t)dt 
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Let Tij be drawn from Exp(r^j). Then 


PiCj I Ci) = = min{ri,i,Ti,2,.. .ri,n}) 
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